WL-TR-9  3-1146 


AFOSR/NL 

110  DUNCAN  AVENUE,  SUITE  100 
WASHINGTON  DC  20332-6448 


NOVEMBER  1993 


APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  IS  UNLIMITED. 


flllC  QUALITY 


AVIONICS  DIRECTORATE 

WRIGHT  LABORATORY 

AIR  FORCE  MATERIEL  COMMAND 

WRIGHT  PATTERSON  AFB  OH  45433-7409 


DTIC 

ELl-CTE 

junm  n 


94-19956 

illliill  94  6  29  026 


NOTICE 


When  Government  drawings,  specifications,  or  other  data  are  used  for 
any  purpose  other  than  in  connection  with  a  definitely  Government-related 
procurement,  the  United  States  Government  incurs  no  responsibility  or  any 
obligation  whatsoever.  The  fact  that  the  government  may  have  formulated  or 
in  any  way  supplied  the  said  drawings,  specifications,  or  other  data,  is  not 
to  be  regarded  by  implication,  or  otherwise  in  any  manner  construed,  as 
licensing  the  holder,  or  any  other  person  or  corporation;  or  as  conveying 
any  rights  or  permission  to  manufacture,  use,  or  sell  any  patented  invention 
that  may  in  any  way  be  related  thereto. 


This  report  is  releasable  to  the  National  Technical  Information  Service 
(NTIS).  At  NTIS,  it  will  be  available  to  the  general  public,  including 
foreign  nations. 


This  technical  report  has  been 

tion. 

C.  4 » i 

Leemon  C.  Baird  III 

Advanced  Systems  Research  Section 

System  Avionics  Division 


reviewed  and  is  approved  for  publica- 

— - _ 

WILLIAM  R.  BAKER,  Acting  Chief 
Advanced  Systems  Research  Section 
System  Avionics  Division 


Qbi uJu  A 

CHARLES  H.  KRUEGER,  Chief 
System  Avionics  Division 
Avionics  Directorate 


If  your  address  has  changed,  if  you  wish  to  be  removed  from  our  mailing 
list,  or  if  the  addressee  is  no  longer  employed  by  your  organization  please 
notify  WL/AAAT  ,  WPAFB,  OH  45433-  7301  to  help  us  maintain  a  current 
mailing  list. 


Copies  of  this  report  should  not  be  returned  unless  return  is  required  by 
security  considerations,  contractual  obligations,  or  notice  on  a  specific 
document. 


REPORT  DOCUMENTATION  PAGE 


Form  Approved 
OMB  No.  0704-0188 


Public  reporting  burden  for  this  collection  of  information  is  estimates  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources, 
gathering  and  maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this 
collection  of  information,  including  suggestions  for  reducing  this  Ourden.  to  Washington  Headquarters  Services.  Directorate  for  information  Operations  and  Reports.  12 IS  Jefferson 
Oavts  Highway.  Suite  1204.  Arlington.  VA  22202-4302.  and  to  the  Office  of  Management  and  Budget,  Paperwork  Reduction  Project  (0704-0188).  Washington.  DC  20503. 


1.  AGENCY  USE  ONLY  (Leave  blank) 


4.  TITLE  AND  SUBTITLE 
Advantage  Updating 


6.  AUTHOR(S) 

Capt  Leemon  C.  Baird  III,  USAF 


3.  REPORT  TYPE  AND  OATES  COVERED 

Final 


5.  FUNDING  NUMBERS 

PE  61 
PR  2312 
TA  R1 
WU  02 


■ 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADORESS(ES) 
WL/AAAT-1  Advanced  Systems  Research  Section 
Bldg  635,  2185  Avionics  Circle 
Wright-Patterson  AFB,  OH  45433-7301 


9.  SPONSORING /MONITORING  AGENCY  NAME(S)  AND  AODRESS(ES) 
AFOSR/NL 

110  Duncan  Avenue,  Suite  100 
Washington,  D.C.  20332-6448 


8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


WL-TR-93-1146 


10.  SPONSORING  /  MONITORING 
AGENCY  REPORT  NUMBER 


12a.  DISTRIBUTION /AVAILABILITY  STATEMENT 


Approved  for  Public  Release;  distribution  is  unlimited. 


13.  ABSTRACT  (Maximum  200  words) 

A  new  algorithm  for  reinforcement  learning,  advantage  updating,  is  proposed. 
Advantage  updating  is  a  direct  learning  technique;  it  does  not  require  a  model  to  be 
given  or  learned.  It  is  incremental,  requiring  only  a  constant  amount  of  calculation  per 
time  step,  independent  of  the  number  of  possible  actions,  possible  outcomes  from  a  given 
action,  or  number  of  states.  Analysis  and  simulation  indicate  that  advantage  updating  is 
applicable  to  reinforcement  learning  systems  working  in  continuous  time  (or  discrete 
time  with  small  time  steps)  for  which  Qrleaming  is  not  applicable.  Simulation  results 
are  presented  indicating  that  for  a  simple  linear  quadratic  regulator  (LQR)  problem  with 
no  noise  and  large  time  steps,  advantage  updating  learns  slightly  faster  than  CHeaming. 
When  there  is  noise  or  small  time  steps,  advantage  updating  learns  more  quickly  than  Qr 
learning  by  a  factor  of  more  than  100,000.  Convergence  properties  and  implementation 
issues  are  discussed.  New  convergence  results  are  presented  for  ^-learning  and 
algorithms  based  upon  change  in  value.  It  is  proved  that  the  learning  rule  for 
advantage  updating  converges  to  the  optimal  policy  with  probability  one. 


14.  SUBJECT  TERMS 

reinforcement  learning,  dynamic  programming,  Q-Iearning, 
advantage  updating,  continuous  time 


17.  SECURITY  CLASSIFICATION 
OF  REPORT 

UNCLASSIFIED 


18.  SECURITY  CLASSIFICATION 
OF  THIS  PAGE 
UNCLASSIFIED 


19.  SECURITY  CLASSIFICATION 
OF  ABSTRACT 
UNCLASSIFIED 


15.  NUMBER  OF  PAGES 
41  pages 


16.  PRICE  CODE 


20.  LIMITATION  OF  ABSTRACT 


NSN  7540-01-280-5500 


1 


Standard  Form  298  (Rev.  2-89) 

Prescribed  by  ANSI  Std  239-18 
298-102 


Table  of  Contents 


List  of  Figures  iv 

List  of  Tables  iv 

Acknowledgments  v 

1.  Introduction  1 

2.  Reinforcement  Learning  Systems  2 

3.  The  Advantage  Updating  Algorithm  13 

4.  A  Linear  Quadratic  Regulator  Problem  1 8 

5.  (^-learning  With  Small  Time  Steps  20 

6.  Simulation  Results  24 

7.  Convergence  of  Advantage  Updating  30 

8.  Implementation  Issues  33 

9.  Conclusion  36 

10.  References  37 

Appendix  A:  Notation  39 


Appendix  B:  LQR  constants 


list  of  Figures 


Figure  1.  Counterexample  for  R-leaming  6 

Figure  2.  Counterexample  for  change  in  value  9 

Figure  3.  Optimal  LQR  trajectories  20 

Figure  4.  Optimal  LQR  functions  for  Q- learning  2 1 

Figure  5.  Optimal  LQR  functions  for  advantage  updating  23 

Figure  6.  Time  steps  required  for  learning  as  a  function  of  noise  25 

Figure  7.  Time  steps  required  for  learning  as  a  function  of  time  step  duration,  Dt  26 


List  of  Tables 

Table  1.  Comparison  of  several  algorithms  17 

Table  2.  Learning  rate  constants  and  number  of  time  steps  required  for  learning  28 

Table  3.  Optimal  learning  rate  constants  and  number  of  time  steps  required  for  learning  29 


iv 


ACKNOWLEDGMENTS 


This  research  was  supported  under  Task  2312R1  by  the  Life  and  Environmental  Sciences  Directorate  of  the 
United  States  Air  Force  Office  of  Scientific  Research.  The  author  gratefully  acknowledges  the  contributions  of 
Harry  Klopf,  Mance  Hannon,  Jim  Morgan,  Gabor  Bartha,  Scott  Weaver,  and  Tommi  Jaakkola. 


v 


1.  Introduction 


This  report  provides  an  overview  of  reinforcement  learning,  proposes  a  new  algorithm  for  reinforcement 
learning  in  continuous  time,  and  gives  simulation  results.  Section  2  provides  background  information  on 
Markov  processes,  various  reinforcement  learning  algorithms,  and  the  notation  used  throughout  this  paper.  This 
section  can  be  skipped  by  readers  familiar  with  reinforcement  learning.  Section  3  proposes  the  advantage 
updating  algorithm,  both  for  discrete  time,  and  in  the  limit  for  continuous  time.  Section  4  presents  the  linear 
quadratic  regulator  (LQR)  problem  that  was  used  as  a  testbed  for  comparing  advantage  updating  to  Q-leaming. 
Section  5  analyzes  why  advantage  updating  would  be  expected  to  learn  more  quickly  than  Q-leaming,  and 
section  6  gives  simulation  results  consistent  with  this  analysis.  Section  7  discusses  convergence  of  advantage 
updating,  and  section  8  discusses  implementation  issues. 
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2.  Reinforcement  learning  Systems 


A  reinforcement  learning  system  typically  uses  a  set  of  real-valued  parameters  to  store  the  information  that  is 
learned.  When  a  parameter  is  updated  during  learning,  the  notation: 

W* - K  (1) 

represents  the  operation  of  instantaneously  changing  the  parameter  W  so  that  its  new  value  is  K,  whereas: 

W<-2—K  (2) 

represents  the  operation  of  moving  the  value  of  W  toward  K.  This  is  equivalent  to: 

- {\~a)WM  +  ocK  (3) 

where  the  learning  rate  a  is  a  small  positive  number.  Appendix  A  summarizes  this  and  other  notation 
conventions. 

A  Markov  sequential  decision  process  (MDP)  is  a  system  that  changes  its  state  as  a  function  of  its  current 
state  and  inputs  received  from  a  controller.  The  set  of  possible  states  for  a  given  MDP,  and  the  set  of  possible 
actions  from  which  the  controller  can  choose,  may  each  be  finite  or  infinite.  At  time  t,  the  controller  chooses  an 
action  ut,  based  upon  the  state  of  the  MDP,  xt.  The  MDP  then  transitions  to  a  new  state  x,+&t  where  A t  is  the 
duration  of  a  time  step.  The  state  transition  may  be  stochastic,  but  the  probability  P(u[yxtjcl+&t)  of  transitioning 
from  state  xt  to  state  xt+At  after  performing  action  ut  is  a  function  of  only  xt,  xi+&t  and  uc,  and  is  not  affected  by 
previous  states  or  actions.  If  there  are  a  finite  set  of  possible  states  and  actions,  then  is  a 

probability.  If  there  are  a  continuum  of  possible  states  or  actions,  then  P(utJtJi+&i)  is  a  probability  density 
function  (PDF).  If  time  is  continuous  rather  than  discrete,  then  P{utj.hxt)  is  the  probability  that  action  ut  will 
cause  the  rate  of  change  of  the  state  to  be  x*.  The  MDP  also  sends  the  controller  a  scalar  value  known  as 
reinforcement.  If  time  is  discrete,  then  the  total  reinforcement  received  by  the  controller  during  time  step  t  is 
Ptsfat-tid-  If  time  is  continuous,  then  the  rate  of  flow  of  reinforcement  at  time  t  is  r{xt,ut). 
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A  reinforcement  learning  problem  is  the  problem  of  determining  which  action  is  best  in  each  state  in  order 
to  maximize  some  function  of  the  reinforcement.  The  most  common  reinforcement  learning  problem  is  the 
problem  of  finding  actions  that  maximize  the  expected  total  discounted  reinforcement,  which  for  continuous 
time  is  defined  as: 

I^Yr{x„u,)dt^  (4) 

where  <•>  denotes  expected  value,  and  where  0<y<l  is  the  discount  factor  which  determines  the  relative 
significance  of  earlier  versus  later  reinforcement.  For  discrete  time,  the  total  discounted  reinforcement  received 
during  one  time  step  of  duration  At  when  performing  acdon  ut  in  state  xt  is  defined  as: 

Ru(xt,u,)  =  jyT~‘r(xz,uz)dr  (5) 

t 

The  goal  for  a  discrete-time  controller  is  to  find  actions  that  maximize  the  expected  total  discounted 
reinforcement: 

{ **(*,*>« (6) 

This  expression  is  often  written  with  At  not  shown  and  with  y  chosen  to  implicitly  reflect  At,  but  is  written 
here  with  the  At  shown  explicitly  so  that  expression  (6)  will  reduce  to  expression  (4)  in  the  limit  as  At  goes  to 
zero.  A  policy,  iz(x),  is  a  function  that  specifies  a  particular  action  for  the  controller  to  perform  in  each  state  x. 
The  optimal  policy  for  a  given  MDP,  rc* (x),  is  a  policy  such  that  choosing  ut=K*(xt)  results  in  maximizing  the 
total  discounted  reinforcement  for  any  choice  of  starting  state.  If  reinforcement  is  bounded,  then  at  least  one 
optimal  policy  is  guaranteed  to  exist.  The  value  of  a  state,  V* (x),  is  the  expected  total  discounted  reinforcement 
received  when  starting  in  state  x  and  choosing  all  actions  in  accordance  with  an  optimal  policy.  The  functions 
stored  in  a  learning  system  at  a  given  time  are  represented  by  variables  without  superscripts  such  as  n,V,  A,  or 
Q.  The  true,  optimal  functions  that  are  being  approximated  are  represented  by  *  superscripts,  such  as  n*,  V*, 
A*,  otQ*. 
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Expression  (6)  is  the  most  common  performance  measure  for  defining  a  reinforcement  learning  problem,  but 
it  is  not  the  only  conceivable  measure.  For  example,  Sutton  (1990b)  and  others  have  considered  a  different 
performance  measure,  which  Schwartz  (1993)  calls  T -optimality.  For  this  performance  measure,  the  problem  is 
to  find  a  policy  that  maximizes  the  average  reinforcement  p,  which  is  defined  as: 

n-\ 

^  (.R&l  ^iin  ’ 

p  =  iim  ^2 -  (7) 

n 

A  policy  that  maximizes  p  is  always  defined  to  be  better  than  a  policy  that  does  not  maximize  p.  If  two  policies 
both  maximize  p,  then  the  better  policy  is  defined  to  be  the  one  with  the  larger  average  adjusted  value  a,  which 
is  defined  as: 

n-1  m~\ 

a  =  lim  -=^2 -  (g) 

n 

A  policy  is  said  to  be  T-optimal  if  it  maximizes  p  and  has  the  largest  o  of  all  the  policies  that  maximize  p. 
This  means  that  a  learning  system  using  this  performance  measure  will  first  try  to  maximize  the  average  of  all 
future  reinforcements.  If  there  are  several  policies  that  all  maximize  the  average  reinforcement,  then  it  will 
choose  the  policy  that  also  maximizes  near-term  reinforcement.  T-optimal  policies  do  not  always  exist  for 
every  MDP.  If  T-optimal  policies  do  exist  for  a  given  MDP,  they  may  all  be  nonstationary,  so  that  the  optimal 
action  in  a  given  state  may  not  be  a  deterministic  function  of  the  state  alone  (Ross,  1983).  If  stationary  T- 
optimal  policies  exist,  it  is  not  clear  how  to  learn  them.  A  reinforcement  leaning  system  is  a  system  that  is 
capable  of  solving  reinforcement  learning  problems.  One  reinforcement  learning  system  for  finding  T-optimal 
policies  has  been  proposed  by  Schwartz  (1993).  The  algorithm  requires  that  an  R  value  be  stored  for  each  state- 
action  pair,  and  that  a  global  scalar  p  be  stored.  R  values  are  represented  here  by  script  letters  (HQ  to  distinguish 
them  from  reinforcement  (R).  The  update  rules  for  -learning  are  as  follows,  where  the  learning  system 
performs  action  u  in  state  x,  resulting  in  reinforcement  R  and  a  transition  to  state  x  : 
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(9) 


%.(x,u)<r-2—RlSI(x,u)-p  +  max^U'  ,u) 

u 

If  the  action  performed  follows  the  current  policy  (i.e.  u  =  argmax^.(j K,u))  then: 

u 

p <-2 —  /?*  ( x , u )  +  max  !£(x' ,  u' )  -  max  ^(x,u') 

U‘  u 

It  is  not  clear  whether  R  -learning  will  always  cause  the  R  values  to  converge.  Even  when  ^-learning  does 
converge,  and  a  stationary  T-optimal  policy  does  exist,  it  is  still  possible  for  R -learning  to  converge  to  the 
wrong  answer.  Figure  1  shows  one  example  where  /?-leaming  has  converged,  but  the  final  R  values  erroneously 
indicate  that  all  possible  policies  are  equally  good.  This  MDP  has  the  property  that  any  policy  under 
consideration  has  a  single  average  reward  independent  of  the  initial  state.  It  might  be  expected  that  this 
property  would  ensure  that  whenever  ^-learning  converges  it  must  arrive  at  a  T-optimal  policy,  but  that  is  not 
the  case,  ^-learning  is  a  recent  development,  and  it  is  possible  that  future  versions  of  -learning  will  avoid  this 
difficulty.  The  use  of  undiscounted  performance  measures  in  reinforcement  learning  is  an  important  question 
and  deserves  further  research,  but  due  to  the  current  difficulties  with  using  the  T-optimality  performance 
measure,  it  will  not  be  considered  further  here.  The  following  discussions  and  results  all  pertain  to  the  problem 
of  maximizing  the  standard  performance  measure  (expected  discounted  reinforcement)  given  in  expression  (6). 
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P- 0 


(a)  (b) 

Figure  1.  Counterexample  for  /(-learning 


A  deterministic  MDP  is  with  two  states  and  two  actions  is  given  for  which  /(-learning 
converges,  but  does  not  learn  to  distinguish  the  T-optimal  policy  from  nonoptimal 
policies.  T-optimal  is  the  undiscounted  performance  measure  defined  by  Schwartz 
(1993).  Diagram  (a)  shows  the  names  of  the  states  (1  and  2)  and  actions  (A  and  B).  It 
also  shows  the  immediate  reinforcement  received  when  performing  each  action  in  each 
state  (-10,  0,  and  10).  Diagram  (b)  shows  the  initial  R  values  before  learning  starts. 
Initially,  p= 0,  which  is  correct  because  all  possible  policies  yield  an  average 
reinforcement  of  zero  when  starting  in  any  state.  This  MDP  has  the  property  that  any 
policy  under  consideration  has  a  single  average  reward  independent  of  the  initial  state.  It 
might  be  expected  that  this  property  would  ensure  that  whenever  /(-learning  converges  it 
must  arrive  at  a  T-optimal  policy,  but  that  is  not  the  case.  The  T-optimal  policy  is  to 
choose  action  A  in  both  states.  The  worst  possible  policy  is  to  choose  action  B  in  both 
states.  The  initial  R  values  erroneously  indicate  that  all  possible  policies  are  equally 
good.  Repeated  applications  of  the  /(-learning  update  rules  result  in  no  changes  to  p  or  to 
any  of  the  R  values.  Therefore,  /(-learning  will  never  discover  that  the  policy  of  always 
choosing  A  is  better  than  the  policy  of  always  choosing  B. 

One  of  the  earliest  methods  for  finding  policies  that  maximize  expression  (6)  is  the  algorithm  known  as 
value  iteration  (or  simply  called  the  dynamic  programming  algorithm  by  Bertsekas,  1987).  Value  iteration  is 
an  algorithm  for  finding  the  optimal  policy  n*,  given  the  transition  probabilities  P  and  the  reinforcement 
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function  R.  Value  iteration  stores  a  value  V(x)  for  each  state  x.  The  values  are  initialized  to  arbitrary  numbers, 
and  then  are  updated  repeatedly  according  to  the  update  rule: 


V(x)<r 


-max 

u 


Ru(x,u)+y“  '£P(u,x,x')V(x') 


(ID 


If  this  procedure  is  performed  infinitely  often  in  every  state,  then  each  value  V(x)  is  guaranteed  to  converge 
to  the  optimal  value  V* (x).  For  a  given  MDP,  the  function  V*  (pc)  is  the  unique  solution  to  the  Bellman  equation: 


V(x)  =  max 


RvM+y^Piu^x'Wix') 


(12) 


After  convergence,  the  optimal  policy  is  implied  by  the  value  function,  and  can  be  found  quickly: 

^*(x)  =  argmax^/>(M,x,x')[/?A((x,u)+  y^V'U’)]  (13) 

x‘ 

Simple  value  iteration  is  not  well  suited,  however,  for  reinforcement  learning  in  general.  First,  it  requires 
that  the  probabilities  and  reinforcement  function  to  be  known.  If  they  are  not  known,  then  a  separate  learning 
procedure  must  estimate  them.  If  there  are  n  possible  states  and  m  possible  actions,  the  algorithm  requires 
0(nm)  calculations  to  perform  a  single  update.  If  there  is  a  continuum  of  possible  states  and  actions,  then  the 
summation  becomes  an  integral,  and  the  maximization  is  performed  over  an  infinite  set  of  integrals.  If  each 
state  and  action  is  a  high-dimensional  vector,  then  an  approximation  to  update  (11)  will  typically  require  O(nm) 
calculations,  where  n  and  m  each  scale  exponentially  with  the  dimension. 

The  scaling  problems  of  value  iteration  can  be  addressed  by  more  incremental  algorithms  that  require  fewer 
calculations  per  update.  Such  algorithms  typically  store  more  information  than  just  the  V(x)  that  is  stored  during 
learning  for  value  iteration.  For  example,  an  algorithm  might  store  both  an  estimate  of  the  optimal  value  of 
each  state,  V(x),  and  an  estimate  of  the  optimal  action  for  each  state,  ji(x).  There  are  various  incremental, 
asynchronous  algorithms  for  learning  with  such  a  system.  Unfortunately,  these  typically  require  that  7t(x) 
change  instantaneously  during  an  update,  which  may  not  be  possible  if  7t(x)  is  stored  in  a  general  function 
approximation  system  such  as  a  neural  network.  Function  approximation  systems  typically  change  gradually 
rather  than  instantaneously.  Also,  these  algorithms  are  not  guaranteed  to  converge,  even  when  there  are  only  a 
finite  number  of  states  and  actions  (Williams  and  Baird,  1990, 1993). 
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For  systems  with  a  continuum  of  states  and  actions,  differential  dynamic  programming  (Jacobson  and 
Mayne,  1970)  avoids  the  need  to  instantaneously  change  the  policy.  This  algorithm  is  typically  used  to  find  an 
optimal  trajectory  from  a  single  starting  state  to  a  single  final  state.  A  policy  is  found  (by  some  other  means) 
that  leads  from  the  start  state  to  the  final  state.  The  value  function  is  then  calculated  for  the  states  along  the 
trajectory.  The  update  rule  for  differential  dynamic  programming  then  causes  incremental  changes  in  the  value 
and  policy  function  so  that  the  trajectory  is  slowly  changed  to  increase  the  total  reinforcement.  This  algorithm 
is  similar  to  the  backpropagation  through  time  algorithm  (Nguyen  and  Widrow  1990),  which  first  leams  a 
model  of  the  system  being  controlled,  then  improves  the  policy  through  gradient  descent.  Unfortunately,  both 
of  these  algorithms  are  susceptible  to  local  optima;  the  final  policy  will  be  such  that  it  cannot  be  improved  by  an 
infinitesimal  change,  but  there  may  be  an  entirely  different  policy  that  is  much  better. 

Instead  of  storing  a  value  and  a  policy,  a  learning  system  could  instead  store  a  value  V(x)  for  each  state  and  a 
change  in  value  A V(x,u)  for  each  state-action  pair.  The  change  in  value  AV(x,m)  would  represent  the  expected 
difference  between  the  value  of  state  x  and  the  value  of  the  state  reached  by  performing  action  u  in  state  x.  This 
allows  a  more  incremental  algorithm  than  value  iteration,  because  it  is  possible  to  avoid  summing  over  all 
possible  outcomes  for  a  given  action  in  a  given  state,  and  it  is  no  longer  necessary  to  know  a  model  of  the  MDP. 
After  performing  action  u  in  state  x  causing  a  transition  to  state  x',  the  change  in  value  could  be  updated 
according  to  update  (14): 

AV  (x,  u)  < - (x,  w)  +  y* V(x’ )  -  V  (x)  (14) 

After  performing  update  (14),  if  AV(x,w)  is  the  maximum  AV  in  state  x,  (i.e.  action  u  is  the  current  policy),  then 
update  (15)  should  also  be  performed: 

V(x)< - R+y*V{x!)  (15) 

The  idea  of  storing  both  the  value  function  and  the  change  in  value  (or  rate  of  change  of  value)  was  found  to  be 
useful  in  one  application  by  White  and  Sofge  (1990,  1992),  who  incorporated  this  idea  into  a  larger  system  that 
also  included  a  stored  policy.  However,  the  obvious  algorithm  for  updating  such  stored  functions,  updates  (14) 
and  (15),  is  not  guaranteed  to  converge,  even  for  a  simple,  deterministic  MDP  with  only  eight  states,  two 
actions,  and  time  step  Ar=l.  Figure  2  shows  an  example  for  which  this  algorithm  does  not  converge.  For  this 
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MDP,  the  optimal  policy  is  action  A  in  every  state.  The  worst  policy  is  B  whenever  possible.  The  initial  AV 
function  implies  that  the  worst  policy  is  considered  to  be  optimal  by  the  learning  system.  Thus,  not  only  does 
the  learning  system  fail  to  converge  to  the  optimal  policy,  it  also  periodically  implies  the  worst  possible  policy. 
Also,  the  parameter  values  shown  in  Figure  2  constitute  an  attractor,  if  the  initial  parameter  values  are  changed 
slightly,  then  after  each  sequence  of  updates  they  will  move  toward  the  parameter  values  in  Figure  2.  Updates 
(14)  and  (15)  change  the  parameters  instantaneously.  If  there  were  an  a  above  the  arrow,  to  represent  a  more 
gradual  change,  then  the  modified  algorithm  would  also  fail  to  converge  for  this  counterexample.  Each  gradual 
update  shown  in  the  sequence  in  Figure  2  would  simply  be  repeated  several  times.  An  instantaneous  update  can 
always  be  approximated  by  repeating  a  gradual  update.  There  are  other  modifications  that  could  be  imagined 
for  updates  (14)  and  (15),  but  it  is  not  apparent  how  to  modify  them  to  ensure  convenience  to  optimality.  It  is 
not  clear  how  a  reinforcement  learning  system  could  be  built  that  stores  only  V  and  AV  (or  that  stores  V,  AV,  and 
a  policy)  and  that  is  guaranteed  to  converge  to  the  optimal  policy. 


Figure  2.  Counterexample  for  change  in  value 

A  deterministic  MDP  is  given  with  eight  states,  two  actions,  and  a  time  step  Ar=l  for 
which  the  updates  (14)  and  (15)  are  not  guaranteed  to  converge.  The  name  of  each  state 
and  action  is  shown  in  (a).  Action  A  yields  an  immediate  reinforcement  of  2,  and  action 
B  yields  an  immediate  reinforcement  of  1.  The  initial  value  in  each  state,  V(x),  and  initial 
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change  of  value  for  each  state-action  pair,  AV(x,u),  are  shown  in  (b).  The  parameters  fail 
to  converge  when  the  sequence  of  updates: 

{2B,  1A,  2A,  8A,  8B,  1A,  6B, 

4B,  3A,  4A,  2A,  2B,  3A,  8B, 

6B,  5A,  6A,  4A,  4B,  5A,  2B, 

8B,  7A,  8A,  6A,  6B,  7A,  4B} 

is  repeated  infinitely  often,  where  the  numbers  are  states  and  the  letters  are  actions.  After 
performing  the  first  row  of  updates,  all  parameters  have  shifted  clockwise  two  states. 
After  performing  all  four  rows  of  updates,  all  parameters  have  been  updated  at  least  once, 
and  all  parameters  have  returned  to  their  initial  values.  The  optimal  policy  is  to  choose 
action  A  in  every  state.  The  worst  policy  is  to  choose  action  B  whenever  possible.  The 
initial  parameters  cause  the  learning  system  to  classify  the  w~rst  policy  as  being  optimal, 
and  this  is  still  the  case  after  the  above  sequence  has  been  repeatedly  arbitrarily  often. 
The  worst  policy  continues  to  be  classified  as  optimal,  even  if  the  initial  parameters  are 
perturbed  slightly. 


Another  approach  is  to  store  a  probability  of  choosing  each  action  in  each  state,  rather  than  a  single  policy 
action  for  each  state.  This  approach  has  been  used,  for  example,  by  Gullapalli  (1990).  This  approach  requires 
that  the  controller  choose  actions  according  to  the  stored  probabilities  during  learning.  The  probabilities 
typically  converge  to  a  deterministic  policy,  so  exploration  by  the  learning  system  must  decrease  over  time. 
This  prevents  the  issue  of  exploration  from  being  addressed  separately  from  the  issue  of  learning.  It  would  be 
useful  to  have  a  general  algorithm  that  was  guaranteed  to  leam  when  observing  any  sequence  of  actions,  not  just 
actions  chosen  according  to  specific  probabilities.  For  such  an  algorithm,  the  exploration  mechanism  could  be 
designed  freely,  without  concern  that  it  might  prevent  convergence  of  the  learning  algorithm. 

g-leaming  is  an  algorithm  that  avoids  the  problems  of  the  above  algorithms.  It  is  incremental,  direct  (does 
not  need  a  model  of  the  MDP),  and  guaranteed  to  converge,  at  least  for  the  discrete  case  with  a  finite  number  of 
states  and  actions.  Furthermore,  it  can  leam  from  any  sequence  of  experiences  in  which  every  action  is  tried  in 
every  state  infinitely  often.  Instead  of  storing  values  and  policies,  Q-leaming  stores  Q  values.  For  a  given  state 
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x  and  action  u,  the  optimal  Q  value,  Q*(x,u),  is  the  expected  total  discounted  reinforcement  that  is  received  by 
starting  in  state  x ,  performing  action  u  on  the  first  time  step,  then  performing  optimal  actions  thereafter.  The 
maximum  Q  value  in  a  state  is  the  value  of  that  state.  The  action  associated  with  the  maximum  Q  value  in  a 
state  is  the  policy  for  that  state.  Initially,  all  Q  values  are  set  to  arbitrary  numbers.  After  an  action  u  is 
performed  in  state  x,  the  result  is  observed  and  the  Q  value  is  updated: 

Q(x,u)  <-fL—  /?*(*, n)+y*  max  Q(x' ,u)  (16) 

u 

The  equivalent  of  the  Bellman  equation  for  Q -learning  is: 

Q(x,u)  =  R^(x,u)+  Y^^P(u,x,x')maxQ(x'  ,ti)  (17) 

x' 

The  optimal  Q  function,  Q*(x,u),  is  the  unique  solution  to  equation  (17).  The  policies  implied  by  Q*, 
policies  that  always  choose  actions  that  maximize  Q*,  are  optimal  policies. 

Update  (16)  does  not  require  a  model  of  the  MDP,  nor  does  it  contain  any  summations  or  integrals.  The 
computational  complexity  of  a  single  update  is  independent  of  the  number  of  states.  If  the  Q  values  are  stored 
in  a  lookup  table,  then  the  complexity  is  linear  in  the  number  of  actions,  due  to  the  time  that  it  takes  to  find  the 
maximum.  However,  the  term  being  maximized  is  a  stored  function,  not  a  calculated  expression.  This  suggests 
that  if  Q  is  stored  in  an  appropriate  function  approximation  system,  it  might  be  possible  to  reduce  even  this  pan 
of  the  update  to  a  constant-time  algorithm.  One  algorithm  that  does  this  is  described  in  Baird  (1992).  Another 
method,  wire  fitting ,  is  described  in  Baird  and  Klopf  (1993b).  In  both  cases,  the  maximization  of  the  function  is 
performed  incrementally  during  learning,  rather  than  requiring  an  exhaustive  search  for  each  update.  Q-leaming 
therefore  appears  to  have  none  of  the  disadvantages  of  any  of  the  algorithms  described  above,  and  the 
computational  complexity  per  update  is  constant.  Reinforcement  learning  systems  based  on  discrete  Q-leaming 
are  described  in  Baird  and  Klopf  (1993a),  and  Klopf,  Morgan,  and  Weaver  (1993). 

Q-leaming  requires  relatively  little  computation  per  update,  but  it  is  useful  to  consider  how  the  number  of 
updates  required  scales  with  noise  or  with  the  duration  of  a  time  step,  Ar.  An  important  consideration  is  the 
relationship  between  Q  values  for  the  same  state,  and  between  Q  values  for  the  same  action.  The  Q  values 
Q(x,u  i)  and  QQc,U2)  represent  the  long-term  reinforcement  received  when  starting  in  state  x  and  performing 
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action  u\  or  112  respectively,  followed  by  optimal  actions  thereafter.  In  a  typical  reinforcement  learning  problem 
with  continuous  states  and  actions,  it  is  frequently  the  case  that  performing  one  wrong  action  in  a  long  sequence 
of  optimal  actions  will  have  little  effect  on  the  total  reinforcement.  In  such  a  case,  Q{x,ui)  and  Q(x,ui)  will 
have  relatively  close  values.  On  the  other  hand,  the  values  of  widely  separated  states  will  typically  not  be  close 
to  each  other.  Therefore  Q(x\,u)  and  Q(x2,u)  may  differ  greatly  for  some  choices  of  *1  and  X2-  The  policy 
implied  by  a  Q  function  is  determined  by  the  relative  Q  values  in  a  single  state.  If  the  Q  function  is  stored  in  a 
function  approximation  system  with  some  error,  the  implied  policy  will  tend  to  be  sensitive  to  that  error.  As  the 
time  step  duration  At  approaches  zero,  the  penalty  for  one  wrong  action  in  a  sequence  decreases,  the  Q  values 
for  different  actions  in  a  given  state  become  closer,  and  the  implied  policy  becomes  even  more  sensitive  to  noise 
or  function  approximation  error.  In  the  limit,  for  continuous  time,  the  Q  function  contains  no  information  about 
the  policy.  Therefore,  0-leaming  would  be  expected  to  learn  slowly  when  the  time  steps  are  of  short  duration, 
due  to  the  sensitivity  to  errors,  and  it  is  incapable  of  learning  in  continuous  time.  This  problem  is  not  a  property 
of  any  particular  function  approximation  system;  rather,  it  is  inherent  in  the  definition  of  Q  values. 


3.  The  Advantage  Updating  algorithm 


Reinforcement  learning  in  continuous  time  is  possible  through  the  use  of  advantage  updating.  The 
advantage  updating  algorithm  is  a  reinforcement  learning  algorithm  in  which  two  types  of  information  are 
stored.  For  each  state  x,  the  value  V(x)  is  stored,  representing  the  total  discounted  return  expected  when  starting 
in  state  x  and  performing  optimal  actions.  For  each  state  x  and  action  u,  the  advantage,  A  (x,u),  is  stored, 
representing  the  degree  to  which  the  expected  total  discounted  reinforcement  is  increased  by  performing  action 
u  (followed  by  optimal  actions  thereafter)  relative  to  the  action  currently  considered  best.  After  convergence  to 
optimality,  the  value  function  V* (x)  represents  the  true  value  of  each  state.  The  advantage  function  A*(x,u)  will 
be  zero  if  u  is  the  optimal  action  (because  u  confers  no  advantage  relative  to  itself)  and  A*(x,u)  will  be  negative 
for  any  suboptimal  u  (because  a  suboptimal  action  has  a  negative  advantage  relative  to  the  best  action).  For  a 
given  action  u,  the  Q  value  Q*(x,u)  represents  the  utility  of  that  action,  the  change  in  value  AV*(x,u)  represents 
the  incremental  utility  of  that  action,  and  the  advantage  A*Qc,u)  represents  the  utility  of  that  action  relative  to  the 
optimal  action.  The  optimal  advantage  function  A*  can  be  defined  in  terms  of  the  optimal  value  function  V*: 

A*U,k)  =  ~  RM^)-V\x)  +  y^P(u,x,x')V\x')  (18) 

ArL  T 

The  definition  of  an  advantage  includes  a  1/At  term  to  ensure  that,  for  small  time  step  duration  At,  the 
advantages  will  not  all  go  to  zero.  Advantages  are  related  to  Q  values  by: 

A\x,u)  =  ~~\q\x,u)  -  ma  xQ\x,U )]  (19) 

Both  the  value  function  and  the  advantage  function  are  needed  during  learning,  but  after  convergence  to 
optimality,  the  policy  can  be  extracted  from  the  advantage  function  alone.  The  optimal  policy  for  state  x  is  any 
u  that  maximizes  A*(x,u).  The  notation  Amax(x)  is  defined  as: 

A™*(*)  =  max  A(x,u)  (20) 

u 

If  ^max  is  zero  in  every  state,  then  the  advantage  function  is  said  to  be  normalized.  A  max  should  eventually 
converge  to  zero  in  every  state.  The  update  rules  for  advantage  updating  in  discrete  time  are  as  follows: 
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LEARN: 


perform  action  ut  in  state  xt 


A(XlfU,)<r 


1  R“('Xl’U‘)+r*V(‘X‘+*)~V(X‘) 
s  A/ 


(21) 


nx,)<-5-V(j :,)+[<4^(Jr,)-.W*,)]/a 


(22) 


NORMALIZE:  pick  an  arbitrary  state  x  and  pick  an  action  u  randomly  with  uniform  probability 

A(x,u)<r^—  A(x,u)  -  (x)  (23) 

For  the  learning  updates,  the  system  performs  action  ut  in  state  xt  and  observes  the  reinforcement  received, 
RtJftMt),  and  the  next  state,  xt+&.  The  advantage  and  value  functions  are  then  updated  according  to  updates 
(21)  and  (22).  Update  (21)  modifies  the  advantage  function  A(x,u).  The  maximum  advantage  in  state  x  prior  to 
applying  update  (21)  is  AmaXoW0c).  After  applying  update  (21)  the  maximum  is  Amax^OO-  If  these  are 
different,  then  update  (22)  changes  the  value  V£r)  by  a  proportional  amount.  As  a  goes  to  zero,  the  change  in 
Amax  goes  to  zero,  but  the  change  in  Amax  in  update  (22)  is  divided  by  a,  so  the  value  function  will  continue  to 
learn  at  a  reasonable  rate  as  a  decreases.  Advantage  updating  can  be  applied  to  continuous-time  systems  by 
taking  the  limit  as  Ar  goes  to  zero  in  updates  (21),  (22),  and  (23).  For  (22)  and  (23),  At  can  be  replaced  with 
zero.  Substituting  equation  (5)  into  update  (21)  and  taking  the  limit  as  At  goes  to  zero  yields: 

AOe„iO<-^—  A„„U,)  +  V(x,  )In  y  +  V(x, )  +  r{x„u, )  (24) 

The  learning  updates,  (21),  (22),  and  (24),  require  interaction  with  the  MDP  or  a  model  of  the  MDP,  but  the 
normalizing  update,  (23),  does  not.  Normalizing  updates  can  always  be  performed  by  evaluating  and  changing 
the  stored  functions  independent  of  the  MDP.  Normalization  is  done  to  ensure  that  after  convergence 
Amax00=0  in  every  state.  This  avoids  the  representation  problem  noted  above  for  Q-leaming,  where  the  Q 
function  differs  greatly  between  states  but  differs  little  between  actions  in  the  same  state.  Learning  and 
normalizing  can  be  performed  asynchronously.  For  example,  a  system  might  perform  a  learning  update  once 
per  time  step,  in  states  along  a  particular  trajectory  through  state  space,  and  perform  a  normalizing  update 
multiple  times  per  time  step  in  states  scattered  randomly  throughout  the  state  space.  The  advantage  updating 


algorithm  is  referred  to  as  “advantage  updating”  rather  than  “advantage  learning”  because  it  includes  both 
learning  and  normalizing  updates. 


The  equivalent  of  the  Bellman  equation  for  advantage  updating  is  a  pair  of  simultaneous  equations: 

V(x)  +  A(x,u)At  =  R„(x,u)+  y*^p(u’X,x'  )V(x' )  (25) 

X* 

maxA(jE,u)  =  0  (26) 

u 

The  unique  solution  to  this  set  of  equations  is  the  optimal  value  and  advantage  functions  V*(x)  and  A*(xji)- 
This  can  be  seen  by  considering  an  arbitrary  state  x  and  the  action  umax  that  maximizes  the  advantage  in  that 
state.  For  a  given  state,  if  (25)  is  satisfied,  then  the  action  that  maximizes  A  will  also  maximize  the  right  side  of 
(25).  If  the  advantage  function  satisfies  (26),  then  A(x,umax)=0.  Equation  (25)  then  reduces  to  equation  (12), 
which  is  the  Bellman  equation.  The  only  solution  to  this  equation  is  V=V*,  so  V*  is  the  unique  solution  to 
equations  (25)  and  (26).  Given  that  V=V*,  equation  (25)  can  be  solved  for  A  ,  yielding  equation  (18),  so  the 
I  unique  solution  to  the  set  of  equations  (25)  and  (26)  is  the  pair  of  functions  A*  and  V*. 

i 

The  pair  of  equations  (25)  and  (26)  has  the  same  unique  solution  as  the  pair  (27)  and  (28),  because  equation 
(28)  ensures  that  i4max(r)  is  zero  in  every  state. 

V(x)  +  (A(Xlu)  -  =  RJx,u)  +  y^Piu^x'  )V(x' )  (27) 

x' 

maxA(x,u)  =  0  (28) 

u 

If,  in  state  x,  a  large  constant  were  added  to  each  advantage  A*(x,u)  and  to  the  value  V*(x),  then  the  resulting 
advantage  and  value  functions  would  still  satisfy  equation  (27).  However,  the  advantage  function  would  not 
satisfy  equation  (28),  and  so  would  be  referred  to  as  an  unnormalized  advantage  function.  Such  a  function 
would  still  be  useful,  because  the  optimal  policy  can  be  calculated  from  it,  but  it  could  be  difficult  to  represent 
in  a  function  approximation  system.  The  learning  updates  (21)  and  (22)  find  value  and  advantage  functions  that 
satisfy  (27).  The  normalizing  updates  (22)  and  (23)  ensure  that  the  advantage  function  will  be  normalized,  and 
so  will  satisfy  (28)  as  well. 
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The  update  rules  for  advantage  updating  have  a  significant  property:  there  are  time  derivatives  in  the  update 
rules,  but  no  gradients  or  partial  derivatives.  At  time  t,  it  is  necessary  to  know  Amax(x()  and  the  value  and  rate 
of  change  of  V(xt)  while  performing  the  current  action.  It  is  not  necessary  to  know  the  partial  derivative  of  V  or 
A  with  respect  to  state  or  action.  Nor  is  it  necessary  to  know  the  partial  derivative  of  next  state  with  respect  to 
current  state  or  action.  Only  a  few  of  the  recent  values  of  V  need  to  be  known  in  order  to  calculate  the  time 
derivative;  there  is  no  need  for  models  of  the  system  being  controlled.  Existing  methods  for  solving 
continuous-time  optimization  problems,  such  as  value  iteration  or  differential  dynamic  programming,  require 
that  models  be  known  or  learned,  and  that  partial  derivatives  of  models  be  calculated.  For  a  stochastic  system 
controlled  by  a  continuum  of  actions,  previous  methods  also  require  maximizing  over  a  set  of  one  integral  for 
each  action.  Advantage  updating  does  not  require  the  calculation  of  an  integral  during  each  update  operation, 
and  maximization  is  only  done  over  stored  values.  For  this  reason,  advantage  updating  appears  useful  for 
controlling  stochastic  systems,  even  if  a  model  is  already  known  with  perfect  accuracy.  If  the  model  is  known, 
then  the  system  can  learn  by  interacting  with  the  model  as  in  the  Dyna  system  (Sutton,  1990a).  Table  1 
compares  advantage  updating  with  several  other  algorithms. 
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Table  1.  Comparison  of  several  algorithms 


Information 
stored  for 
state  x, 
action  u 

Update  rules 

Bellman  equation 

Direct 

Converge 

ton* 

ConL 

time 

E-learning 

%Lx,u) 

P 

^,<—2 — R  -  p  +  max  Ht,' 

If  following  the  policy  then: 

p<— 2 — E  +  max  ^.'-max^. 

p  =  £[E  +  max  Of  ]- 

yes 

no 

no 

Value  iteration 

V(x) 

V<-2—  R+y*maxV' 

V  =  E[r+y *  max  V'] 

no 

yes 

yes 

Change  in  value 

m 

A  V(x,u) 

AV  <-2—  (R+y*V'-  V)/At 

If  following  the  policy  then: 

P ft. |-  y^y' 

AV=E[R+Y*V'-V]/At 
l/  =  £[E  +  y*  maxV'] 

yes 

no 

yes 

^-learning 

Q(xm) 

Q*r-2—  E  +  y^maxC?' 

Q  =  E[E+y*  maxQ'] 

yes 

yes 

no 

Advantage 

updating 

V(x) 

A(xu) 

A ±-Z—A^  +  (*  +  y“V' -  VO/A/ 
V^—V  +  AA^/a 

For  a  randomly,  uniformly  chosen  action: 

y4<—£ — A-A^ 

V  =  E[E+y*V,]-AAr 

A™*=0 

yes 

yes 

yes 

Equations  are  given  in  a  simplified  form,  where  primed  letters  represent  information 
associated  with  the  next  state  and  unprimed  letters  represent  information  associated  with 
the  current  state.  See  the  text  for  a  more  detailed  form  of  the  equations.  The  fourth 
column  gives  the  equivalent  of  the  Bellman  equation;  the  unique  solution  to  this  equation 
or  set  of  equations  is  the  optimal  function  or  functions  that  should  be  learned,  /^-learning 
is  not  guaranteed  to  learn  to  reject  suboptimal  policies.  Value  iteration  is  not  direct;  it 
requires  a  model  to  be  known  or  learned,  and  it  requires  the  calculation  of  the  maximum 
of  an  infinite  set  of  integrals  to  perform  one  update.  The  algorithms  described  in  the  text 
that  are  based  on  storing  a  change  in  value  are  not  guaranteed  to  converge,  even  for  a 
deterministic  MDP  with  only  eight  states.  <2-leaming  and  E-leaming  do  not  work  in 
continuous  time,  and  are  sensitive  to  function-approximation  errors  when  the  time  step  is 
small.  Advantage  updating  is  direct,  is  guaranteed  to  converge  for  an  MDP  with  finite 
states  and  actions,  and  is  appropriate  for  continuous- time  systems  or  systems  with  small 
time  steps. 
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4.  A  Linear  Quadratic  regulator  Problem 


Linear  Quadratic  Regulator  (LQR)  problems  are  commonly  used  as  test  beds  for  control  systems,  and  are 
useful  benchmarks  for  reinforcement  learning  systems  (Bradtke,  1993).  The  following  linear  quadratic 
regulator  (LQR)  control  problem  can  serve  as  a  benchmark  for  comparing  Q-leaming  to  advantage  updating  in 
the  presence  of  noise  or  small  time  steps.  At  a  given  time  r,  the  state  of  the  system  being  controlled  is  the  real 
value  x(.  The  controller  chooses  a  control  action  ut  which  is  also  a  real  value.  The  dynamics  of  the  system  are: 

x,  =  u,  (29) 

The  rate  of  reinforcement  to  the  learning  system,  r(xt,ui),  is 

r(x,,ut)  =  -xf  -uf  (30) 

Given  some  positive  discount  factor  y<l,  the  goal  is  to  maximize  the  total  discounted  reinforcement: 

•• 

j?r(x,,u,)dt  (31) 

0 

A  discrete-time  controller  can  change  its  output  every  At  seconds,  and  its  output  is  constant  between 
changes.  The  discounted  reinforcement  received  during  a  single  time  step  is: 

f+Af  t+Ai 

R/U(x„u,)=  Jyr~'r(xx,ux)dt=  j  yx-‘ (~(xr  +  tux)2  -  u2)dr  (32) 

l  t 

and  the  total  reinforcement  to  be  maximized  is: 

i-0 
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Given  this  control  problem,  it  is  possible  to  calculate  the  optimal  policy  n*(x),  value  function  V*(x),  Q  value 
function  Q*(x,u),  and  advantage  function  A*(x,u ).  These  functions  are  linear  or  quadratic  for  all  At  and  y£l : 


n(x)  =  -klx 

(34) 

V\x)  =  -k2x2 

(35) 

Q’ (x,u)  =  ~(k2  +  A tk2k2 )x2  ~ lAtk^xu -  A tkju2 

(36) 

A'(x,u )  =  -k^ik^x  +  u)2 

(37) 

The  constants  ki  are  positive  for  all  nonnegative  values  of  At  and  y<l.  For  Ar=0  and  y=l,  all  /fct= 1.  Appendix  B 
gives  the  general  formula  for  each  k[  as  a  function  of  At  and  y. 
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5.  0-LEARNING  WITH  SMALL  TIME  STEPS 


x(t) 


Figure  3.  Optimal  LQR  trajectories 

The  optimal  trajectory  for  the  linear  quadratic  regulator  (LQR)  problem,  starting  at  xo=l, 
for  continuous  time  (solid  line)  and  for  discrete  time  with  time  steps  of  duration  5 
(dashed  line).  In  continuous  time,  the  optimal  speed  is  high  when  x=l,  and  the  speed 
decreases  as  x  approaches  zero.  In  discrete  time,  the  optimal  speed  is  lower  initially,  to 
decrease  the  amount  of  overshoot  on  the  first  time  step. 

Figure  3  illustrates  the  optimal  trajectories  for  At=5  and  Ar=0  (continuous  time)  with  y=0.9.  At  the  first 
instant,  the  optimal  policy  for  continuous  time  is  to  move  at  high  speed,  but  the  optimal  policy  for  the  discrete 
time  system  requires  a  lower  speed  in  order  to  lesson  the  degree  to  which  it  will  overshoot  during  the  first  time 
step.  As  the  time  step  duration  decreases  from  5  to  0,  the  discrete-time  trajectory  converges  to  the  continuous¬ 
time  trajectory.  The  optimal  value  function  and  Q  function  are  also  affected  by  At.  Figure  4  shows  the  value 
functions,  policy  functions,  and  Q  functions  for  Ar=5,  Af=l,  and  A/=0.000l. 
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V,  dt=  5 


policy,  dt=  5 


Q  (x,u) ,  dt=  5 


v,  dt=  : 


policy,  dt=  1 


Q(x,u),  dt=  1 


>2?W**fiNSSS' 


iS;V\ 


,  dt=  0.0001 


policy,  dt=  0.0001 


Q (x, u) ,  dt=  0 .0001 


mm 


Figure  4.  Optimal  LQR  functions  for  ^-learning 

The  optimal  value  function  V*  (top  row),  policy  n*  (middle  row),  and  Q  function  Q* 
(bottom  row)  are  shown  for  the  LQR  problem.  Functions  are  shown  for  time  steps  of 
duration  5  (left  column),  duration  1  (middle  column)  and  duration  0.0001  (right  column). 
In  all  cases,  7=0.9. 


As  the  duration  of  the  time  step  approaches  zero,  the  optimal  policy  and  value  functions  change  slightly, 
approaching  a  linear  and  quadratic  function  respectively,  with  coefficients  of  1.0.  The  change  in  the  optimal  Q 
function  is  more  dramatic,  however.  This  is  visible  in  both  the  equations  and  the  figures.  If  At  is  set  to  zero  in 
equation  (36),  the  Q  function  ceases  to  be  a  function  of  u;  it  is  only  a  function  of  x.  This  affect  is  also  clear  in 
the  figures.  For  a  time  step  duration  of  5,  it  is  obvious  that  for  each  possible  state  there  is  a  unique  action  that 
yields  the  maximum  Q  value.  This  ridge  of  best  Q  values  indicates  the  optimal  policy.  If  the  time  step  duration 
is  decreased  to  1,  the  Q  function  shifts  so  that  the  optimal  policy  is  somewhat  harder  to  see.  It  is  still  the  case, 
though,  that  the  maximum  Q  value  in  each  state  represents  the  optimal  action  in  that  state.  As  the  time  step 
approaches  zero  duration  (continuous  time),  it  becomes  increasingly  difficult  to  extract  the  policy  from  the  Q 
function.  In  the  last  Q  function  graph  in  Figure  4,  for  each  state,  the  Q  function  is  almost  constant  over  all  the 
actions.  There  is  a  very  small  bump  in  the  Q  function  corresponding  to  the  optimal  action  in  each  state,  but  it  is 
too  small  to  be  visible  in  a  graph  of  the  function,  and  it  would  be  very  difficult  to  learn,  for  a  general  function 
approximation  system.  Small  errors  in  function  approximation  can  cause  large  errors  in  the  policy  implied  by 
the  Q  function.  ^-learning  is  not  practical  for  control  when  the  time  step  duration  is  small,  and  Q-leaming  is 
theoretically  impossible  in  a  continuous-time  system. 

This  difficulty  is  not  specific  to  this  particular  control  problem.  A  Q  value  is  defined  as  the  expected  total 
discounted  return  if  a  given  action  is  performed  for  only  a  single  time  step,  followed  by  optimal  actions 
thereafter.  Unfortunately,  in  a  typical  control  system,  the  total  discounted  reinforcement  over  an  entire 
trajectory  is  rarely  affected  much  by  a  suboptimal  control  action  on  a  single  time  step.  Thus  the  Q  function  will 
be  almost  equal  for  all  the  actions  in  a  given  state,  while  exhibiting  large  differences  between  different  states. 
This  is  why  ^-learning  is  not  well  suited  to  problems  with  small  time  steps. 
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Figure  5.  Optimal  LQR  functions  for  advantage  updating 

The  optimal  advantage  function  A*  for  the  LQR  problem  are  given  for  time  steps  of 
duration  5  (left),  duration  1  (middle)  and  duration  0.0001  (right).  In  all  cases,  7=0.9. 

Figure  5  shows  the  advantage  function  for  the  same  parameter  values  used  in  Figure  4.  For  large  time  step 
durations,  such  as  Ar=5,  the  advantage  and  Q  functions  are  almost  identical  except  for  scale.  Both  clearly 
represent  the  policy.  For  smaller  time  steps,  the  advantage  function  continues  to  clearly  represent  the  policy 
and,  even  for  continuous  time,  the  optimal  action  in  each  state  can  be  read  easily  from  the  graph.  This  suggests 
that  the  advantage  updating  algorithm,  which  is  based  upon  storing  values  and  advantages,  might  be  preferable 
to  (2-teaming. 
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6.  SIMULATION  RESULTS 


Advantage  updating  and  Q-leaming  were  compared  on  the  LQR  problem  described  in  the  previous  section. 
In  the  simulations,  the  V  function  was  approximated  by  the  expression  wxx2,  and  the  A  and  Q  functions  were 
approximated  by  w^+w^xm+h^u2.  All  weights,  w,-,  were  initialized  to  random  values  between  ilO*4,  and 
were  updated  by  simple  gradient  descent.  Each  Q  function  was  initialized  with  the  same  weights  as  the 
corresponding  advantage  function  to  ensure  a  fair  comparison.  The  control  action  chosen  by  the  learning 
system  was  constrained  to  lie  in  the  range  [-1,1].  When  calculating  the  maximum  A  or  Q  value  in  a  given  state, 
only  actions  in  this  range  were  considered.  On  each  time  step,  a  state  was  chosen  randomly  from  the  interval  [- 
1,1].  With  probability  0.5,  an  action  was  also  chosen  randomly  and  uniformly  from  that  interval.  With 
probability  0.5,  the  learning  system  chose  an  action  according  to  its  current  policy.  The  advantage  updating 
system  also  performed  one  normalization  step  on  each  time  step  in  a  state  chosen  randomly  and  uniformly  from 
[-1,1].  A  set  of  100  Q-leaming  systems  and  100  advantage  updating  systems  were  allowed  to  run  in  parallel,  all 
initialized  to  different  random  values,  and  all  exploring  with  different  random  states  and  actions.  At  any  given 
time,  the  policy  of  each  system  was  a  linear  function.  The  absolute  value  of  the  difference  between  the  constant 
in  the  current  policy  and  the  constant  in  the  optimal  policy  was  calculated  for  each  of  the  200  learning  systems. 
For  Q-leaming  and  advantage  updating,  the  solution  was  said  to  have  been  learned  when  the  mean  absolute 
error  for  the  100  learning  systems  running  in  parallel  fell  below  0.001.  Figure  6  shows  the  number  of  time  steps 
required  for  learning  when  various  amounts  of  noise  were  added  to  the  reinforcement  signal.  Figure  7  shows 
the  number  of  time  steps  required  for  learning  with  various  time  step  durations. 
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Amount  of  noise 


Figure  6.  Time  steps  required  for  learning  as  a  function  of  noise 

For  a  noise  level  of  n,  uniform,  random  noise  from  the  range  [-/jKH./iIO-4]  was  added  to 
the  reinforcement  on  each  time  step.  For  each  noise  level,  Q  learning  (dashed  line)  used 
the  learning  rate  that  was  optimal  to  two  significant  digits.  Advantage  updating  (solid 
line)  used  learning  rates  with  one  significant  digit,  which  were  not  exhaustively 
optimized,  yet  it  tends  to  require  less  time  than  Q  learning  to  learn  the  correct  policy  to 
three  decimal  places.  For  zero  noise,  advantage  updating  is  only  slightly  faster.  For  a 
noise  level  of  13,  advantage  updating  is  more  than  four  times  faster  than  Q  learning. 


Figure  7.  Time  steps  required  for  learning  as  a  function  of  time  step  duration.  At 

For  each  duration,  Q  learning  (dashed  line)  used  the  learning  rate  that  was  optimal  to  two 
significant  digits.  Advantage  updating  (solid  line)  used  learning  rates  with  one 
significant  digit,  which  were  not  exhaustively  optimized.  Advantage  updating  requires  an 
approximately  constant  number  of  time  steps  to  learn  the  correct  policy  to  three  decimal 
places,  independent  of  At.  For  large  At,  advantage  updating  is  slightly  faster  than  Q- 
leaming  to  learn  the  policy  to  three  decimal  places.  For  small  At,  advantage  updating 
learns  approximately  5  orders  of  magnitude  more  quickly  than  Q  learning.  As  At 
approaches  zero,  the  training  required  by  Q  learning  appears  to  grow  without  bound.  Due 
to  the  time  required  for  the  simulations,  the  last  two  data  points  for  Q  learning  were  found 
with  averages  over  10  systems  rather  than  100. 

Table  2  shows  the  learning  rates  used  for  each  of  the  three  functions  for  both  of  the  algorithms  for  various 
noise  levels,  and  Table  3  shows  learning  rates  for  various  time  step  durations.  For  the  simulations  described 
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here,  normalization  was  done  once  after  each  learning  update,  and  both  types  of  update  used  the  same  learning 
rate.  Advantage  updating  could  be  optimized  by  changing  the  number  of  normalizing  updates  performed  per 
learning  update,  but  this  was  not  done  here.  One  learning  update  and  one  normalizing  update  were  performed 
on  each  time  step.  To  ensure  a  fair  comparison  for  the  two  learning  algorithms,  the  learning  rate  for  0-leaming 
was  optimized  for  each  simulation.  Rates  were  found  by  exhaustive  search  that  were  optimal  to  two  significant 
digits.  The  rates  for  advantage  updating  had  only  a  single  significant  digit,  and  were  not  exhaustively 
optimized.  The  rates  used  were  sufficient  to  demonstrate  that  advantage  updating  learned  faster  than  ^-learning 
in  every  simulation.  Advantage  updating  appears  more  resistant  to  noise  than  Q-leaming,  with  learning  times 
that  are  shorter  by  a  factor  of  up  to  seven.  This  may  be  due  to  the  fact  that  noise  introduces  errors  into  the 
stored  function,  and  the  policy  for  advantage  updating  is  less  sensitive  to  errors  in  the  stored  functions  than  for 
Q -learning.  All  of  Figure  6,  and  the  leftmost  points  of  Figure  7,  represent  simulations  with  large  time  steps. 
When  the  time  step  duration  is  small,  the  difference  between  the  two  algorithms  is  more  dramatic.  In  Figure  7, 
as  the  time  step  duration  A t  approaches  zero  (continuous  time),  advantage  updating  is  able  to  solve  the  LQR 
problem  in  a  constant  216  time  steps.  Q  -learning,  however,  requires  approximately  10/At  time  steps. 
Simulation  showed  a  speed  increase  for  advantage  updating  by  a  factor  of  over  160,000.  Smaller  time  steps 
might  have  resulted  in  a  larger  factor,  but  Q-leaming  would  have  learned  too  slowly  for  the  simulations  to  be 
practical.  Even  for  a  fairly  large  time  step  of  At=0.03,  advantage  updating  learned  twice  as  quickly  as  Q- 
leaming.  When  Af=0.03,  the  optimal  policy  reduces  x  by  90%  in  81  time  steps.  This  suggests  that  if  a 
controller  updates  its  outputs  50  times  per  second,  then  advantage  updating  will  learn  significantly  faster  than 
Q-leaming  for  operations  that  require  at  least  2  seconds  (100  time  steps)  to  perform.  Further  research  is 
necessary  to  determine  whether  this  is  true  for  systems  other  than  a  simple  LQR  problem. 
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Table  2.  Learning  rate  constants  and  number  of  time  steps  required  for  learning 
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a 

3 
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to 

*  1 

IHHH 

1.4 

6.$ 

0.5 

■rn 

1 

1.4 

1.0 

0.3 

0.5 

272 

222 

2 

0.74 

0.6 

0.3 

0.3 

415 

286 

3 

0.44 

0.5 

0.3 

0.3 

660 

375 

4 

0.26 

0.4 

0.3 

0.4 

1,128 

445 

5 

0.17 

0.3 

0.2 

0.3 

1,688 

561 

6 

0.11 

0.2 

0.4 

0.1 

2,402 

755 

7 

0.088 

0.2 

0.2 

0.2 

3,250 

765 

8 

0.073 

0.1 

0.1 

0.07 

3,441 

1,335 

9 

0.054 

0.1 

0.09 

0.05 

4,668 

1,578 

10 

0.050 

0.1 

0.1 

0.06 

4,880 

1,761 

11 

0.046 

0.08 

0.06 

0.06 

5,506 

1,761 

12 

0.030 

0.06 

0.1 

0.1 

8,725 

1,832 

13 

0.028 

0.06 

0.1 

0.1 

8,863 

1,845 

14 

0.022 

0.06 

0.1 

0.1 

11,642 

1,850 

15 

0.018 

0.06 

0.1 

0.1 

13,131 

1,890 

16 

0.018 

0.06 

0.1 

0.1 

13,183 

1,902 

Learning  rate  constants  and  the  number  of  time  steps  required  to  learn  are  given  for  the 
case  of  Q  learning  and  advantage  updating,  with  At=0.1,  and  varying  levels  of  noise. 
There  are  100  identical  learning  systems  learning  in  parallel,  with  different  initial  random 
weights  and  different  random  actions.  The  system  is  defined  to  have  learned  the  policy 
when  the  mean  absolute  value  of  the  error  in  the  policy  constant  for  the  100  systems  is 
less  than  0.001.  The  learning  rates  for  Q  learning  are  optimal  to  two  significant  digits. 
The  learning  rates  for  advantage  updating  have  only  a  single  significant  digit,  and  have 
not  been  completely  optimized. 
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Table  3.  Optimal  learning  rate  constants  and  number  of  time  steps  required  for  learning 
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3E-4 

1.4 

0.9 

0.3 

0.5 
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0.5 
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IE-8 

0.9 

0.3 
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Optimal  learning  rate  constants,  a,  and  number  of  time  steps  required  for  learning,  r,  are 
given  for  Q-leaming  and  advantage  updating,  with  no  noise,  and  varying  time  step 
durations.  At.  100  identical  learning  systems  learn  in  parallel,  with  different  initial 
random  weights  and  different  random  actions.  The  system  is  defined  to  have  learned  the 
policy  when  the  mean  absolute  value  of  the  error  in  the  policy  constant  for  the  100 
systems  is  less  than  0.001.  Results  marked  with  represent  averages  over  10  systems 
rather  than  100.  The  learning  rates  for  Q  learning  are  optimal  to  two  significant  digits. 
The  learning  rates  for  advantage  updating  have  only  a  single  significant  digit,  and  have 
not  been  completely  optimized. 
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7.  Convergence  of  advantage  Updating 


There  are  three  types  of  convergence  that  are  desirable  for  an  algorithm  such  as  advantage  updating.  First, 
performing  only  learning  updates  should  ensure  that  the  policy  implied  by  the  advantage  function  should 
converge  to  optimality.  Second,  performing  only  normalizing  updates  should  ensure  that  A(x,u )  becomes 
normalized,  that  is,  Amax(x)  converges  to  zero  in  every  state.  Third,  the  full  advantage  updating  algorithm 
(performing  both  types  of  updates)  should  ensure  that  V(x),  AQc,u),  Amax(x)  and  the  policy  implied  by  A(x,u)  all 
converge  to  optimality.  Theorem  1  and  theorem  2,  below,  show  the  first  two  types  of  convergence.  A  theorem 
guaranteeing  the  third  type  of  convergence  has  not  yet  been  shown  to  be  impossible,  but  will  require  further 
analysis. 

Theorem  1.  A  sequence  of  updates  ensures  that,  with  probability  one,  V(x)  converges  to  V*(x)  and  the  value  of 
the  policy  implied  by  A(x,u)  converges  to  optimality  with  probability  one  if: 

(1)  There  are  a  finite  number  of  possible  states  and  actions. 

(2)  Each  state  receives  an  infinite  number  of  learning  updates  and  a  finite  number 

(possibly  zero)  of  normalizing  updates. 

wm  — 

(3)  ^aK{x,u)  =  <x>  and  ^a*(*,u)  is  finite,  where  a„(x.u)  is  the  learning  rate  used  for 

*»i  *«i 

the  nth  time  the  learning  updates  are  applied  to  action  u  in  state  x. 

(4)  Vjc,u3no  such  that  Vn  >  n<,  pn(x,u)  =  an(x,u)At 


Proof: 

If  the  above  conditions  are  satisfied,  then  at  some  point  in  time  during  learning,  p=aAr  and  all  future 
updates  are  learning  updates  (no  normalizing  updates).  Define  the  function  Q  to  be  the  left  side  of  equation 
(27),  so  Q(x^)=V(x)+(A(jc,«)-Arnax(x))A/.  The  learning  updates  in  advantage  updating  change  the  quantity 
g(x,u)  in  the  same  way  that  Q-leaming  does  when  p=aAf.  Therefore,  Q  will  converge  to  Q*  with  probability 
one,  which  ensures  that  the  value  of  the  policy  implied  by  the  Q  function  will  converge  to  optimality  with 
probability  one  (Watkins  1989,  Watkins  and  Dayan,  1992).  Note  that  according  to  this  definition  of  Q,  the 
maximum  Q  value  in  state  x  always  equals  V(x).  Therefore,  if  Q  converges  to  Q*,  then  V  must  converge  to  V*. 
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In  a  given  state,  the  action  that  maximizes  Q  will  also  be  the  action  that  maximizes  A.  The  value  of  the  policy 
implied  by  the  Q  function  converges  to  optimality  with  probability  one,  therefore  the  value  of  the  policy  implied 
by  the  A  function  must  also  converge  to  optimality.  O 

Theorem  2.  A  sequence  of  updates  ensures  that  Amax(\)  converges  to  zero  with  probability  one  in  each  state 
(the  advantage  function  goes  into  normal  form)  if  : 

(1)  There  are  a  finite  number  of  possible  actions. 

(2)  Each  state  receives  an  infinite  number  of  normalizing  updates  and  a  finite  number 
(possibly  zero)  of  learning  updates.. 

(3)  The  learning  rate  a  for  each  state  is  constant. 


Proof: 

Define  the  stored  information  after  appl)  mg  all  the  learning  updates  as  the  “initial”  parameter  values,  so  the 
learning  updates  can  be  ignored.  Normalizing  updates  in  one  state  are  not  affected  by  other  states,  so  it  is 
sufficient  to  consider  a  single  state.  First,  consider  the  case  where  AmaxCO  is  initially  positive.  Define  S  to  be 
the  sum  of  the  positive  advantages  in  state  x.  Note  that  a  normalizing  update  cannot  change  the  sign  of  AmaxM, 
and  cannot  increase  any  advantage  in  state  x.  If  a  normalizing  update  is  performed  in  state  x  on  one  of  the 
positive  advantages,  then  either  S  will  be  decremented  by  aAmax(x),  or  else  one  of  the  positive  advantages  will 
become  nonpositive  and  will  remain  nonpositive  after  all  future  updates.  If  there  are  n  possible  actions,  then  the 
latter  can  happen  at  most  n- 1  times.  The  maximum  of  a  set  of  positive  numbers  is  greater  than  or  equal  to  the 
average,  so  AmaxQc)>Sfn.  Therefore,  decreasing  S  by  co4max(r)  results  in  S  being  decreased  by  at  least  ctS/n. 
State  x  will  always  have  at  least  one  positive  advantage,  so  each  update  has  a  probability  of  at  least  1  In  that  it 
will  update  a  positive  advantage.  An  infinite  number  of  updates  will  result  in  an  infinite  number  of  updates  to 
positive  advantages  (with  probability  one),  which  results  in  S  being  decremented  by  at  least  aSIn  an  infinite 
number  of  times,  which  causes  S  to  converge  to  zero  with  probability  one.  The  second  case  is  if  AmatX(x)  is 
initially  negative.  In  that  case,  each  update  has  a  probability  of  at  least  1/n  that  Amax(x)  will  be  increased  by  at 
least  -aAmaxCO-  With  probability  one,  this  will  happen  an  infinite  number  of  times,  ensuring  convergence  with 
probability  one.  D 
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Theorem  1  indicates  that  the  learning  updates  alone  are  sufficient  to  learn  the  optimal  policy,  when  the 
advantage  function  is  stored  in  a  lookup  table.  However,  the  advantage  function  may  be  unnormalized,  with 
large  values  in  one  state,  and  small  values  in  another  state.  An  unnormalized  advantage  function  can  be  as 
difficult  for  a  function  approximation  system  to  represent  as  a  Q  function,  for  similar  reasons.  Theorem  2 
indicates  that  the  normalizing  update  does  tend  to  put  the  advantage  function  into  normal  form.  Therefore  a 
sequence  containing  both  types  of  updates  may  converge  to  an  advantage  function  that  implies  an  optimal 
policy,  and  also  is  sufficiently  normalized  to  be  learned  easily  by  a  function  approximation  system.  It  appears 
possible  to  prove  convergence  for  the  full  advantage  updating  algorithm,  where  both  learning  and  normalizing 
updates  are  performed  infinitely  often  for  every  state-action  pair.  A  proof  of  this  convergence  result,  based  on 
the  results  of  Jaakkola,  Jordan,  and  Singh  (1993),  will  appear  in  a  forthcoming  paper. 
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8.  IMPLEMENTATION  ISSUES 


Many  optimal  control  problems  occurring  in  practice  have  continuous,  high-dimensional  state  and  action 
vectors.  This  suggests  that  the  V  and  A  functions  should  be  represented  with  general  function  approximation 
systems  that  learn  from  examples,  rather  than  using  lookup  tables.  Possible  systems  might  include  a  multilayer 
perceptron,  a  radial  basis  function  network,  a  CMAC,  or  a  memory-based  learning  system  using  k- nearest- 
neighbor  interpolation.  Such  systems  can  be  trained  by  giving  examples  of  the  value  of  a  function  for  various 
inputs. 

Continuous-time  advantage  updating  requires  knowledge  of  the  rate  of  change  of  value,  V(x,).  If  the 
learning  system  is  constantly  calculating  the  value  of  V  as  the  state  changes,  then  simple  filters  and  techniques 
from  adaptive  control  theory  can  be  used  to  estimate  V  (xt)  at  a  particular  time  t.  In  fact,  the  filter  can  even  be 
noncausal,  using  the  values  of  V  at  times  later  than  time  t  as  well  as  at  times  earlier  than  time  t  in  the  calculation 
of  V(xt).  It  is  also  acceptable  for  the  estimate  of  V  (xt)  to  be  somewhat  noisy.  As  long  as  the  noise  has  zero 

mean  and  bounded  variance,  this  should  not  prevent  convergence  of  the  advantage  updating  algorithm  to  the 
correct  policy,  although  noise  would  be  expected  to  slow  the  convergence. 

An  additional  issue  arises  when  the  action  vector  is  continuous.  All  forms  of  dynamic  programming  require 
the  calculation  on  each  time  step  of  a  maximum  (or  minimum,  or  minimax  saddlepoint).  In  £>-leaming,  for  an 
update  of  a  single  parameter,  it  is  necessary  to  find  the  maximum  Q  value  in  a  particular  state.  Value  iteration 
and  policy  iteration  require  the  calculation  of  a  sum  or  integral  over  all  possible  state  transitions  for  a  given 
action.  This  calculation  must  be  repeated  for  each  possible  action  in  a  given  state,  and  the  maximum  of  the 
calculated  values  must  be  found  in  order  to  update  a  single  parameter.  In  advantage  updating,  for  a  single 
application  of  step  1  or  step  2  above,  the  maximum  advantage  in  a  given  state  must  be  found.  If  the  state  and 
action  vectors  are  continuous,  and  the  functions  are  stored  (for  example)  in  a  single-hidden-layer,  sigmoidal 
network,  then  it  is  difficult  to  find  the  action  that  maximizes  the  output  for  a  given  state.  There  are  three 
approaches  to  finding  this  maximum. 

The  first  approach  is  to  find  the  maximizing  action  through  traditional  search  techniques,  treating  the  stored 
function  as  an  unknown  function  to  be  sampled  repeatedly  while  trying  to  find  the  maximum.  This  can  be 
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computationally  intensive  and  can  be  subject  to  problems  with  local  maxima,  especially  in  high-dimensional 
action  spaces. 

The  second  approach  exploits  the  ability  of  advantage  updating  to  work  with  small  time  steps.  For  example, 
an  MDP  might  have  a  time-step  duration  of  A t,  a  state  vector  x,  and  a  scalar  action  u  which  is  a  real  number 
between  zero  and  one.  An  almost-equivalent  MDP  is  one  with  a  time-step  duration  of  At/100,  a  state  vector 
(x,u),  and  only  two  possible  actions:  increase  u  by  1/100,  or  decrease  u  by  1/100.  The  latter  MDP  is  almost 
equivalent  to  the  former;  given  an  optimal  policy  for  the  latter  MDP,  an  approximately  optimal  action  for  state  x 
in  the  former  MDP  can  be  found  through  at  most  100  evaluations  of  the  policy  for  the  latter  MDP.  In  the  limit, 
using  large  factors  instead  of  100,  this  approach  reduces  the  problem  of  continuous  actions  to  an  equivalent 
problem  with  discrete  actions.  The  algorithm  for  this  maximization  method  in  the  limit  is  given  in  Baird 
(1992). 

The  third  approach  to  finding  maxima  of  learned  functions  is  the  wire  fitting  approach,  described  in  Baird 
and  Klopf  (1993b).  It  is  possible  to  take  any  function  approximation  system  and  embed  it  in  a  larger  system 
which  makes  trivial  the  problem  of  finding  the  global  maximum  for  each  state.  The  maximum  of  the  function  in 
a  given  state  can  be  found  in  constant  time.  This  approach  appears  general,  and  less  computationally  intensive 
than  the  one  in  Baird  (1992),  and  has  been  shown  to  work  well  on  a  simple  can-pole  control  problem. 

The  advantage  updating  algorithm,  as  described  here,  was  designed  for  use  with  a  lookup  table.  It  has  been 
shown  (Baird  and  Harmon,  1994)  that  when  function-approximation  systems  are  combined  with  algorithms 
such  as  Q-leaming  or  advantage  updating,  it  can  be  useful  to  modify  the  learning  algorithm  to  perform  gradient 
descent  on  the  mean  squared  Bellman  residual.  If  a  function  approximation  system  is  used  for  the  advantage 
and  value  functions,  and  if  the  system  being  controlled  is  deterministic,  then  a  given  weight  W  in  the  function- 
approximation  system  could  be  changed  according  to  equation  (39): 
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E  =  (*,  . ' u, )  +  Y*v  (x<+*  )-v(x'))^-A(x»u')  +  max  A(x<  ’l “))  +  (max  A(x,,u)J  (38) 


AW  =  - 


a  dE 
2  dW 


=  -a((^(^t,uf )  +  Y*V{xt+u )  -  V(x,  ))/Ar  -  A(x„u, )  +  maxA(x„u)j 

•f  f~T~Eej  (xnu i ) +  Y*  ~T~  V(xt+u )  ~  -T~  V(xt  )1~ - ~  A(  x,  ,u,)  +  ——  max  A(x,  ,u)\ 

dW  K  dW  v  ''jAr  dW  v  '  ,}  dW  -  v  '  ’) 


(39) 


-a  max  A(x. ,  u)— —  max  A(xt,u) 
“  dW  “ 


where  E  is  the  sum  of  the  two  squared  Bellman  residuals  associated  with  advantage  updating.  This  form  of 

advantage  updating  can  be  described  by  the  single  equation  (39),  rather  than  requiring  three  different  updates, 

and  it  requires  the  choice  of  only  a  single  learning  rate  a,  rather  than  three  rates  a,  P,  and  o>.  As  a  simple 

gradient-descent  algorithm,  it  is  also  guaranteed  to  converge  to  the  correct  answer  under  reasonable 

assumptions.  However,  if  the  system  being  controlled  is  nondeterministic,  it  is  necessary  to  generate  two 

different  possible  "next  states"  xt+&t  for  a  given  action  ut  performed  in  a  given  state  xt.  One  xt+&t  must  be  used 

to  evaluate  V(xt+AtX  and  the  other  must  be  used  to  evaluate  —•  V(jc,+llf ).  This  ensures  that  the  weight  change  is 

oW  ^ 


an  unbiased  estimator  of  the  true  Bellman-residual  gradient,  but  requires  a  system  such  as  in  Dyna  (Sutton, 
1990a)  to  generate  the  next  state.  Simulation  results  for  (39)  are  presented  in  Harmon,  Baird,  and  Klopf  (1994). 


Throughout  this  paper,  it  has  been  assumed  that  the  advantage  of  an  action  is  defined  relative  to  the 
maximum  action  in  a  given  state.  This  simplifies  the  equations,  but  may  lead  to  an  optimal  advantage  function 
that  is  not  smooth.  To  avoid  this  problem,  one  could  arbitrarily  choose  a  given  action  in  each  state,  wref,  as  a 
reference  action.  The  advantage  for  a  given  action  would  then  be  defined  as  the  degree  to  which  that  action  is 
better  than  the  reference  action.  With  this  modification,  the  advantage  of  the  reference  action, 
A, fix)  =  A(x,urtf),  would  be  forced  to  zero  by  normalization.  Then  would  be  replaced  with  Aref  in 

updates  (22)  and  (23),  in  equations  (26)  and  (28),  and  in  the  last  line  of  equation  (39).  A  max  would  be  replaced 
with  ( Amax  - Aref )  in  updates  (21)  and  (24),  in  equation  (27),  and  in  the  rest  of  equation  (39).  A  learning  system 
will  typically  spend  a  large  proportion  of  the  time  following  the  policy,  so  it  generally  appears  better  to  use 
Amax  as  the  reference  rather  than  Aref,  but  it  would  be  interesting  to  investigate  the  use  of  Aref  instead. 
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9.  CONCLUSION 


Advantage  updating  is  shown  to  learn  slightly  faster  than  0-leaming  for  problems  with  large  time  steps  and 
no  noise,  and  far  more  quickly  for  problems  with  small  time  steps  or  noise.  Advantage  updating  works  in 
continuous  time,  which  Q-lei iming  cannot  do.  Advantage  updating  also  has  better  convergence  properties  than 
/?-leaming,  differential  dynamic  programming,  or  algorithms  based  on  stored  change  in  value  or  stored  policies. 
Complete  learning  systems  for  continuous  states,  actions,  and  time  can  be  built  using  this  algorithm  with 
existing  function  approximation  systems,  function  maximization  systems,  and  filter  systems.  Unlike  differential 
dynamic  programming  or  value  iteration,  it  is  possible  for  advantage  updating  to  learn  without  a  model.  If  a 
model  is  known  or  learned,  advantage  updating  may  be  combined  with  the  model  as  in  Dyna  (Sutton,  1990a).  If 
the  system  being  controlled  is  stochastic,  this  direct  method  combined  with  the  model  could  be  more  efficient 
than  an  indirect  method  combined  with  the  model.  This  is  due  to  the  fact  that  some  indirect  methods  require 
maximization  over  infinite  sets  of  integrals  in  order  to  accomplish  a  single  update,  whereas  advantage  updating 
can  accomplish  the  calculation  of  both  the  integrals  and  the  maximization  incrementally.  Future  work  will 
include  analysis  of  additional  convergence  issues,  and  application  of  advantage  updating  to  more  difficult 
problems. 
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APPENDIX  A:  NOTATION 


xt  State  at  time  t 

U{  Control  action  at  time  t.  In  discrete-time  control,  action  is  constant  throughout  a  time  step. 

/rAi(Xbut)  Rate  of  reinforcement  at  time  t  while  performing  action  «/  in  state  xt. 

Rm(x,u)  Total  discounted  reinforcement  during  a  single  time  step  starting  in  state  x  with  constant  action 

u..  R  is  the  integral  of  r  as  time  varies  over  a  single  time  step. 

$frc,u)  Information  stored  by  ^-learning  for  the  state-action  pair  (x,u).  R  values  are  not  usually  written 

in  script,  but  a  script  ^is  used  here  to  distinguish  R  vales  from  reinforcement. 

K*(x)  Optimal  control  action  to  perform  in  state  x. 

V*(x)  Total  discounted  reinforcement  over  all  time  if  starting  in  state  x  then  acting  optimally. 

Q*(x,u)  Total  discounted  reinforcement  over  all  time  if  starting  in  state  x,  doing  u,  then  acting  optimally. 

AV*(x ,u)  Expected  value  of  V*  (x’)-V*  (x),  where  x'  is  the  state  reached  by  performing  action  u  in  state  x. 

A*(xji)  Amount  by  which  action  u  is  better  than  the  optimal  action  in  maximizing  total  discounted 

reinforcement  over  all  time.  A *  is  zero  for  optimal  actions,  negative  for  all  other  actions. 
k,V,QAAV  Learning  system's  estimates  of  K*,  V*,  Q*,  A *,  and  AV*. 

All  parameter  updates  are  represented  by  arrows.  When  a  parameter  is  updated  during  learning,  the  notation: 

W< - K  (40) 

represents  the  operation  of  instantaneously  changing  the  parameter  W  so  that  its  nev  v  alue  is  K,  whereas: 

W<r^—K  (41) 

represents  a  partial  movement  of  the  value  of  W  toward  K,  which  is  equivalent  to: 

- (l-o)  Wm  +  oK  (42) 

where  the  learning  rate  a  is  a  small  positive  number. 
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APPENDIX  B:  LQR  CONSTANTS 


For  Ar*0  and  y*l,  the  following  3  equations  give  the  constants,  kh  for  the  optimal  controller  for  the  LQR 
problem.  If  Ar=0,  or  7=1,  or  both,  the  constants  are  calculated  by  evaluating  the  limit  of  the  right  side  the 
equations  as  At  goes  to  zero,  or  y  goes  to  one,  or  both: 


K  = 


1-  y*"\  2 y*  -2ArIny-2-(l-y*)ln2y  +  ^(2  +  ln2y)2(l-  y*)2-4Ar2y*ln2  y 

At  J  2  -  2 y2*  +  4 Ary*  In  y  +  (1  -  y2*  )ln2  y  +  (1  -  y*  h/( 2  +  In2  y)2(l  -  y*)2  -  4Ar2y*  In2  y 


(43) 


(2  +  In2  y)(l  -  y*  )2  -  2 A r2y*  In2  y  -  (1  -  y*  )^/(2  +  In2  y)2(l  -  y*  )2  -  4Ar2 y*  In2  y 

2Ar2y*ln3y 

(2  +  In2  yXy2*  - 1)  -  4Ary*  In  y-  (1  -  y*)i/(2  +  In2  y)2(l  -  y*)2  -  4Ar2y*  ln^y 

2Aryln3  y 


(44) 

(45) 


The  validity  of  these  equations  can  be  verified  by  substituting  the  equations  for  ki  into  the  equations  for  7t*,  V*, 

Q*,  and  A*,  then  substituting  those  equations  into  the  Bellman  equations  to  check  that  they  are  satisfied.  The 

following  Mathematica  code  calculates  all  the  functions,  and  verifies  that  the  given  equations  for  the  constants 

do  lead  to  functions  V*  and  A*  that  satisfy  equation  (25).  The  last  line  prints  the  difference  between  the  two 

sides  of  equation  (25)  as  a  function  of  x,  u.  At,  and  y. 

(**********  check  validity  of  functions  and  constants  ***********) 

s  [g_f  dt_]  :=  Sqrt  [  (2+Log[g]  *2)  A2  (l-gAdt)  A2-4dt/'2g/'dt*Log [g]  A2) 

kl  [g_, dt J  :==  (l-gAdt)  / dt*  (2gAdt-2dt*Log [g]  -2-  (l-gAdt ) Log [q]~2+s  [g,  dt ] )  / 

(2-2gA (2dt) +4dt*gAdt*Log[g] + (l-gA (2dt ) ) Log [g] A2+ (l-gAdt ) s [g, dt] ) 
k2[g_,dt_] : = ( (2+Log [g] A2) (l-gAdt) A2-2dtA2gAdt*Log [g] A2- (l-gAdt) s [g, dt ] )  / 
(2dtA2gAdt*Log [g] A3) 

k3 [g_, dt_] := ( (2+Log [g] A2) (gA (2dt) -1) -4dt*gAdt*Log [g] - (l-gAdt) s [g, dt ] )  / 
(2dt*Log[g] A3) 

r  [x_, u_, g_, dt__]  :  =-xA2-uA2 

R  [x_,u_,g_,dt_]  :=  Integrate [gAt*r [x+t*u,u, g, dt] , {t, 0, dt} ] 

V  [x_,g_,dt_]  :=-k2 [g,dt] *xA2 

A  [x_,u_, g_, dt_]  :=-k3 (g, dt] * (kl [g, dt] *  x+u)A2 

pi [x_f g_, dt_]  :=-kl [g, dt] *x 

Q  [x_f  u_,  g_,  dt  J  :=  V  (x,  g,  dt] +dt*A[x,  u,  g,  dt  ] 

Together  [gAdt*V I x+dt*u,  g,  dt]  -V[x,  g,  dt]+R[x,u,  g,  dt]-dt*A[x,u,g,dt]  ] 


The  code  prints  the  number  zero,  therefore  the  equation  is  satisfied  for  all  values  of  x,  a.  At,  and  y.  It  is  clear 
that  A*  is  nonpositive  everywhere,  and  is  zero  when  following  the  policy  it*.  Therefore,  the  functions  A*,  V*, 
and  it*  are  also  correct.  It  is  also  clear  that  Q*=V*  +A*At,  therefore  Q*  is  correct.  The  following  code  finds  the 
constants  for  several  special  cases,  as  well  as  the  general  formula  for  R.: 
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(**********  special  values  of  R  and  k  ***********) 

R  [x,u,g,dt] 

Limit  [R[x,  u,  g,  dt] ,  g->l] 

Limit [kl [g, dt] , g->l] 

Limit [k2 [g, dt] , g->l] 

Limit  [k3 [g,  dt] ,  g->l] 

Limit [kl [g, dt] , dt->0] 

Limit [k2  [g,  dt  ] , dt->0 ] 

Limit [k3 [g, dt ] , dt->0] 

Limit [Limit [kl [g, dt] , g->l] , dt->0] 

Limit [Limit [k2  [g,  dt] , g->l] ,  dt->0] 

Limit [Limit [k3[g,dt] , g->l] , dt->0] 

This  finds  that  the  total,  discounted  reinforcement  received  during  a  single  time  step  of  duration  At,  starting  in 
state  jc,  with  a  constant  control  action  of  u  throughout  the  time  step  is: 

2(1  -  y*)u2  +  2u(ln y^uAry*  - (1  -  y*)*)  +  In2  y((l -  y*  -  At2y“)u2  +  (1  -  y“)x2  - lAty^ux) 

R(x,u)  = - — r -  (46) 

In3  y 


If  y^l,  then  R  reduces  to: 


R(x,u )  =  -A  t(u2  +x2  +  A  txu  +  Atu2/3) 

(47) 

For  no  discounting  (y=l),  the  constants  are: 

,  _  3Af  +  6^/l  +  A% 

6 + 2  At2  +  6AtiJl  +  ^/yi 

(48) 

k2  =  ^ll  +  At/2 

(49) 

k,  =  i+A%+at^+A'X2 

(50) 

For  continuous  time  (Ar=0),  the  constants  are: 

_lny+V4  +  ln2y 
^  **  2 

(51) 

*3  =  1 

(52) 

For  continuous  time  with  no  discounting  (Ar=0,  y=l) ,  the  constants  reduce  to: 

n 

ii 

ii 

(53) 
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